Chaos Engineering

What is Chaos Engineering?

Chaos Engineering is a discipline that aims to uncover weaknesses and improve the resilience of software systems by intentionally injecting controlled failures into them.

Robust, resilient IT systems are crucial to data-driven operations. Whether these systems drive internal processes or deliver customer-facing services, the need for reliability and availability remains the same. So, why would you deliberately try to break your services? Chaos engineering does just that – deliberately terminating instances in your production environments.

The key phrase is “in production environments” - Why not restrict testing to the dev environment?. Using chaos engineering principles, you introduce an important element of randomness into testing and accelerate the process of identifying single points of failure. System failures are rarely predictable, and the chaos monkey can surface issues that have not been previously considered. If you only ever test for what you think may break, other important issues may be overlooked.

In this way, random outages help to keep testing honest. The testing scripts cannot be skewed, shortened, or cheated, and every fault identified is real – you can literally see the problem and its effects.

In practice, to avoid damage to production environments, chaos engineers often start in a non-production environment and cautiously extend to production in a controlled manner.


The history of Chaos Engineering

Online video streaming service Netflix was one of the first organizations to popularize the concept of Chaos Engineering with their Chaos Monkey engine.

When Netflix began migrating to the cloud in 2010, they found a potential problem with hosted infrastructure – hosts could be terminated and replaced at any moment, potentially affecting quality of service. To ensure a smooth streaming experience, their systems needed to be able to manage these terminations seamlessly.

To assist with testing, Netflix developers created ‘Chaos Monkey’. This application runs in the background of Netflix operations, terminating production services randomly.

The Netflix Chaos Monkey is perhaps the best-known example of chaos engineering. And as cloud services mature, this chaos engineering methodology will gain in popularity. Netflix themselves have progressed and introduced “The Simian Army”, which is a suite of open-source cloud testing tools that enable developers to test the reliability, resilience, recoverability, and security of their cloud services. The army has seen Chaos Monkey joined by a host of ape and primate like friends such as Chaos Gorilla, Janitor Monkey, 10-18 Monkey (short for Localization-Internationalization, or l10n-i18n), Doctor Monkey, Latency Monkey and Conformity Monkey.

A fuller history of Chaos Engineering is covered on the Wikipedia page: Chaos engineering - Wikipedia.


How Chaos Engineering works?

Chaos Engineering is a practice that involves intentionally injecting controlled failures into a software system to uncover weaknesses and improve its resilience. The general steps involved in Chaos Engineering are as follows:

  • Define the system's steady state: Start by understanding how the system should behave under normal conditions. Identify the key performance indicators (KPIs) and metrics that define the system's expected steady state.
  • Formulate a hypothesis: Formulate a hypothesis about a potential weakness or vulnerability in the system. This could be related to a specific component, a dependency, or a particular failure scenario you want to test.
  • Design experiments: Based upon the hypothesis, engineers will design controlled experiments to simulate specific failure scenarios. These scenarios could involve introducing network latency, simulating CPU spikes, or terminating specific services, among other possibilities.
  • Implement and execute experiments: Implement the necessary changes or configurations to inject the desired failures into the system. Execute the experiments in a controlled and isolated environment, such as a staging or testing environment, to minimize the impact on production systems.
  • Monitor and observe: During the experiments, monitor the system's behavior, including metrics, logs, and user experience. This step is crucial to understand how the system responds to the injected failures and whether it maintains its expected behavior or exhibits any weaknesses or unexpected behavior.
  • Analyze the results: Analyze the data collected during the chaos experiments to evaluate the system's behavior and resilience. Determine if the system met the expected outcomes and if any weaknesses or vulnerabilities were exposed. This analysis helps identify areas for improvement and informs the next steps.
  • Iterate and improve: Based on the findings from the chaos experiments, iterate on the system's design, architecture, or configuration to address any identified weaknesses. Implement changes and repeat the process to validate the improvements and ensure the system's resilience is enhanced. This process can involve tuning monitoring and observability tools to better understand systems.

The goal of Chaos Engineering is not to cause chaos indefinitely but to create controlled chaos to learn about the system's behavior under stressful conditions and make it more resilient. By deliberately injecting failures and observing the system's response, teams can identify and fix weaknesses before they manifest as significant issues in production environments, ultimately improving system reliability and customer experience. The quality of the observability and monitoring tools in place to observe and understand the systems response are essential to success.


Chaos Engineering tools

There are several popular tools used for Chaos Engineering, some commercial and some open-source, each with its own features and capabilities. Here are some frequently used tools:

  • Chaos Monkey: Developed by Netflix, Chaos Monkey is one of the earliest and most well-known tools for Chaos Engineering. It randomly terminates virtual machine instances within an infrastructure to simulate failures and test the system's ability to handle them.
  • Chaos Toolkit: An open-source tool, Chaos Toolkit provides a framework for running Chaos Engineering experiments. It offers a wide range of drivers and probes to inject various types of failures and gather metrics during experiments. Chaos Toolkit supports multiple platforms and is highly extensible.
  • Gremlin: Gremlin is a commercial tool that offers a comprehensive set of Chaos Engineering capabilities. It provides a user-friendly interface to create and execute various failure scenarios, including network disruptions, CPU spikes, and resource exhaustion. Gremlin supports both on-premises and cloud-based systems and includes native integrations for Kubernetes, AWS, Azure, Google Cloud, and even bare metal infrastructure.
  • Chaos Mesh: Chaos Mesh is an open-source Chaos Engineering platform tailored for Kubernetes. It enables users to orchestrate and schedule chaos experiments within a Kubernetes cluster. Chaos Mesh supports a wide range of failure injection techniques, such as pod failures, network delays, and even kernel-level chaos. Its web interface “The Chaos Dashboard” is popular with users.
  • Litmus: Litmus is another open-source Chaos Engineering platform built for Kubernetes. It provides a rich set of predefined chaos experiments, including pod failures, network latency, and disk I/O errors. Litmus also offers observability features to monitor the system during chaos experiments. Many commercial tools are built upon Litmus such as Harness.

A far broader useful community list of tools is available from GitHub - dastergon/awesome-chaos-engineering: A curated list of Chaos Engineering resources. Alongside a multitude of other Chaos Engineering resources.

Some useful developer and ITOps tools and methodologies that may be of use within a Chaos Engineering strategy or mindset are detailed in IT Infrastructure Management – Tools and Strategies (eginnovations.com). Whilst most are not dedicated Chaos tools, many are useful for kicking the tires of deployments (Wan Emulation, Fuzzing, Load Testing and other methodologies are discussed).


Chaos Engineering in the cloud

Increasingly cloud providers are recognizing the demand for Chaos Engineering testing from organizations running applications and services upon their cloud infrastructure and are providing services to help such workflows. For example AWF Fault Injection Simulator was released in March 2021, a fully managed service for running fault injection experiments on AWS that makes it easier to improve an application’s performance, observability, and resiliency. Fault injection experiments are used in chaos engineering, which is the practice of stressing an application in testing or production environments by creating disruptive events, such as sudden increase in CPU or memory consumption, observing how the system responds, and implementing improvements. Beyond the managed service, open-sourced tools have also been released by AWS Engineering.

Microsoft Azure has a similar offering to AWS’ (as of July 2023 in preview) - Azure Chaos Studio - Chaos engineering experimentation | Microsoft Azure. The availability of such tools and services is becoming a factor when choosing between different clouds or virtualization environments.


What are the benefits of Chaos Engineering?

Conducting tests on the production system is quite a high risk. Organizations usually need a relatively robust and mature platform before they can consider unleashing a tool such as the Chaos Monkey. However, there are also some benefits:

  • Real World Testing: Avoids the overheads of trying to replicate your entire production environment for testing, which helps to reduce costs. It is also almost impossible to properly simulate effects at scale in a development environment.
  • High Priority and Urgency: If problems are found there is an added incentive to address issues quickly. Any breakages caused by the Chaos Engineering process will need to be fixed as fast as possible to maintain an adequate level of service for customers. It’s also worth remembering that building fixes in the production environment will dramatically reduce time to deployment.

Other benefits cited are usually long-term and assume that Chaos Engineering is implemented well-enough to identify enough serious faults to justify any problems it might inadvertently cause. Benefits such as:

  • Identifying weaknesses: Chaos Engineering helps teams discover vulnerabilities and weaknesses in their systems that might not be evident during regular testing. By intentionally breaking things, engineers can gain valuable insights into how the system responds to failure and improve its overall reliability.
  • Improved system resilience: By subjecting a system to controlled failures, teams can make necessary improvements to enhance the system's resilience. This leads to a more robust architecture that can better handle unexpected situations, such as network outages or hardware failures.
  • Cost-effective testing: Chaos Engineering can be a cost-effective way to identify potential problems in a system before they manifest in real-world scenarios. It allows teams to test different failure scenarios without incurring the costs associated with actual outages.
  • Faster incident response: Chaos Engineering provides teams with a deeper understanding of their system's behavior under stress, enabling them to respond more effectively to real incidents when they occur.
  • Improved customer experience: A more resilient system translates to improved uptime and a better overall experience for end-users, leading to increased customer satisfaction.

What are the challenges of Chaos Engineering?

Whilst there are many pros for using Chaos Engineering there are also a number of cons, or at least challenges, including:

  • Complexity and risk: Injecting chaos into a production system requires careful planning and execution to avoid causing actual harm. If not done properly, Chaos Engineering could lead to unintended negative consequences, impacting users, data, or the business. Remembering Chaos Engineering involves performing experiments upon _production_ systems – if it goes wrong – it can potentially go _horribly_ wrong.
  • Requires expertise: Chaos Engineering involves deep knowledge of the system's architecture and the potential impacts of failures. Teams need to have skilled engineers who understand the complexities of the system to carry out effective chaos experiments.
  • Time-consuming: Implementing Chaos Engineering practices can be time-consuming, especially in large and complex systems. Teams need to invest time and resources in designing, executing, and analyzing experiments.
  • False sense of security: Relying solely on Chaos Engineering might create a false sense of security. While it helps identify weaknesses, it cannot catch all possible failure scenarios, and other traditional testing methods are still necessary.
  • Organizational resistance: Some organizations may be resistant to the idea of intentionally causing failures in their systems, fearing potential disruptions to critical services or operations. If Chaos Engineering leads to real world (alibi unintended) impacts on business systems or customers there can be considerable finger-pointing and discussions within an organization. Some arguments around this are discussed in Charity Majors, CEO of Honeycomb’s blog: Testing in production: Yes, you can (and should) | Opensource.com.

Whilst Chaos Engineering can be a valuable tool for improving system resilience and identifying weaknesses, it should be approached with caution and proper expertise. When executed effectively, it can significantly enhance a system's ability to handle failures and improve the overall customer experience. If an organization does not have the skills or maturity to implement Chaos Engineering well, it may well not be an appropriate methodology for that organization to adopt. Cloud Consultant Mathias Lafeldt’s much circulated blog is a good starting point around this decision, see: The Limitations of Chaos Engineering | Mathias Lafeldt (sharpend.io).


Examples of Chaos Engineering

Beyond the well-known Netflix / Chaos Monkey implementation there are many other examples of Chaos Engineering tools and practices in use.

Some good articles covering real world use of Chaos Engineering include:


How AIOps powered monitoring increases the value of Chaos Engineering

eG Enterprise offers extensive applications and platform monitoring functions, allowing you to assess current system health – and the effects of every chaos monkey-inspired failure.

Observability is an essential component of Chaos Engineering. If you are to understand the impact of a chaos engineering experiment or test, you first need to understand the steady state behavior, which means needing a monitoring baseline. Monitoring is generally difficult and complex for highly distributed, possibly auto-scaling, cloud native systems.

Specialist tools such as eG Enterprise are designed to enable observability and give administrators established baselines of behavior over time. Intelligent AIOps capabilities leveraging statistical algorithms alongside Machine Learning not only provide that baseline but also provide continual analysis and understanding of metrics, logs and traces as chaos engineering tests are performed to alert operators to the impacts of the experiment.  

Steve Upton’s blog covers how tools with root-cause diagnostic capabilities are essential for a functioning Chaos Engineering testing, see: Four chaos engineering mistakes to avoid | Thoughtworks.

As AWS’s blog discusses, here: Verify the resilience of your workloads using Chaos Engineering | AWS Architecture Blog (amazon.com), one useful monitoring tool feature leveraged within a chaos engineering strategy is Synthetic Monitoring, allowing organizations to use robot users to test modifications rather than relying on real users encountering issues first with all the “chaos” that can bring.